class: center, middle, inverse, title-slide .title[ # FIN7028: Times Series Financial Econometrics 7 ] .subtitle[ ## Rethinking financial prediction ] .author[ ### Barry Quinn ] .date[ ### 2023-02-23 ] --- layout: true <div class="my-footer"> <span> Barry Quinn CStat </span> </div> --- .acid[ .hand[Learning Outcomes] - Simple (but useful) forecasting techniques - Internal validation - External validation - Prediction uncertainty - Model complexity and sample size ] --- class: middle ## Rethinking forecasting * Terminology in econometrics can sometimes be confusing. * The terms **forecasting** and **prediction** are used interchangably. * **Prediction** can sometimes refer to the in-sample predictions from the estimated model * In this course we will refer to these as **fitted values** or **retrodictions** * In this course forecasting and prediction will mean: **determining the value that a series is likely to take** --- class: middle ## Some simple forecasting methods <!-- --> --- class: middle # How would you forecast these data? .pull-left[ #### Average method * Forecast of all future values is equal to mean of historical data `\(\{y_1,\dots,y_T\}\)`. * Forecasts: `\(\hat{y}_{T+h|T} = \bar{y} = (y_1+\dots+y_T)/T\)` #### Naïve method * Forecasts equal to last observed value. * Forecasts: `\(\hat{y}_{T+h|T} =y_T\)`. * Consequence of efficient market hypothesis. ] .pull-right[ #### Seasonal naïve method * Forecasts equal to last value from same season. * Forecasts: `\(\hat{y}_{T+h|T} =y_{T+h-m(k+1)}\)`, where `\(m=\)` seasonal period and `\(k\)` is the integer part of `\((h-1)/m\)`. #### Drift method * Forecasts equal to last value plus average change. * Forecasts `$$\hat{y}_{T+h|T} = y_{T} + \frac{h}{T-1}\sum_{t=2}^T (y_t-y_{t-1})$$` `$$\hat{y}_{T+h|T} = y_T + \frac{h}{T-1}(y_T -y_1)$$` * Equivalent to extrapolating a line drawn between first and last observations. ] --- class: middle ## Some simple forecasting methods .pull-left-2[ <!-- --> ] .pull-right-1[ ## Some simple forecasting methods * Mean: `meanf(y, h=24)` * Naïve: `naive(y, h=24)` * Seasonal naïve: `snaive(y, h=24)` * Drift: `rwf(y, drift=TRUE, h=24)` ] --- class: middle .your-turn[ * Use these four functions to produce forecasts for `ni_hsales_ts` and `glen_m_ts`. * Plot the results using `autoplot()`. ] --- class: middle ## Rethinking econometrics * Adjustments or transformations of the historical data can lead to a simpler forecasting task. * They simplify data patterns by: 1. removing known sources of variation. 2. making the pattern more consistent across the whole data set. * Simpler patterns usually lead to more accurate forecasts. --- class: middle .pull-left[ ## Rethinking econometrics * For explaining tasks data transformations simplify patterns but may also manufacture overconfidence. * Frequently, scholars pre-average some data to construct variables for regression analysis. * Averaging can be dangerous as it removes variation. * One solution in explaining tasks is to use multilevel models, which preserve uncertainty in the original, pre-averaged values, while still using the average to make predictions. ] .pull-right[ ## Inflation adjustments * Data which is affected by the value of money are best adjusted before modelling. * Financial time series are usually adjusted so that all values are stated in dollar values from a particular year. * The adjustment is made using a price index. * If `\(z_t\)` denotes the UK consumer price index and `\(y_t\)` denotes the nomimal value of the QSMF in month `\(t\)` then `\(x_t=y_t/z_t \times z_{\text{May 2016}}\)` gives the adjusted (real) QSMF value at May 2016 prices. ] --- class: middle # Residual diagnostics .pull-left[ ## Fitted values (retrodictions) - `\(\hat{y}_{t|t-1}\)` is the forecast of `\(y_t\)` based on observations `\(y_1,\dots,y_{t-1}\)`. - We call these "fitted values" and always involve a one-step ahead forecast. - Sometimes drop the subscript: `\(\hat{y}_t \equiv \hat{y}_{t|t-1}\)`. - Often not true forecasts since parameters are estimated on all data. - In terms of terminology calling these *retrodictions* is more meaningful ] .pull-right[ ### For example: - `\(\hat{y}_{t} = \bar{y}\)` for average method. - `\(\hat{y}_{t} = y_{t-1} + (y_{T}-y_1)/(T-1)\)` for drift method. ] --- class: middle .blockquote.large[ **Residuals in forecasting:** difference between observed value and its fitted value: `\(e_t = y_t-\hat{y}_{t|t-1}\)`. ] ## Assumptions 1. `\(\{e_t\}\)` uncorrelated. If they aren't, then information left in residuals that should be used in computing forecasts. 2. `\(\{e_t\}\)` have mean zero. If they don't, then forecasts are biased. ## Useful properties (for prediction intervals) 3. `\(\{e_t\}\)` have constant variance. 4. `\(\{e_t\}\)` are normally distributed. --- class: middle ## Example: FTSE index price .panelset[ .panel[ .panel-name[plot] <img src="data:image/png;base64,#index_files/figure-html/ftse1-1.png" width="50%" /> ] .panel[ .panel-name[Naive Forecast] `$$\hat{y}_{t|t-1}= y_{t-1}$$` `$$e_t = y_t-y_{t-1}$$` >Note: `\(e_t\)` are one-step-forecast residuals ] .panel[ .panel-name[Code + Output] <img src="data:image/png;base64,#index_files/figure-html/unnamed-chunk-2-1.png" width="50%" /> ] .panel[ .panel-name[Diagnostics] <img src="data:image/png;base64,#index_files/figure-html/unnamed-chunk-3-1.png" width="50%" /> ] .panel[ .panel-name[Diagnostic] <img src="data:image/png;base64,#index_files/figure-html/ftse4-1.png" width="50%" style="display: block; margin: auto;" /> ] .panel[ .panel-name[ACF] <img src="data:image/png;base64,#index_files/figure-html/ftse5-1.png" width="50%" /> ] ] --- class: middle ## ACF of residuals * We assume that the residuals are white noise (uncorrelated, mean zero, constant variance). If they aren't, then there is information left in the residuals that should be used in computing forecasts. * So a standard residual diagnostic is to check the ACF of the residuals of a forecasting method. * We *expect* these to look like white noise. --- class: middle ## `checkresiduals` function ```r checkresiduals(naive(ftse_m_ts)) ``` <!-- --> ``` ## ## Ljung-Box test ## ## data: Residuals from Naive method ## Q* = 7.0892, df = 12, p-value = 0.8517 ## ## Model df: 0. Total lags used: 12 ``` --- class: inverse, center .salt[Evaluating forecast accuracy] --- class: middle ## Training and test sets <!-- --> - A model which fits the training data well will not necessarily forecast well. - A perfect fit can always be obtained by using a model with enough parameters. - Over-fitting a model to data is just as bad as failing to identify a systematic pattern in the data. * The test set must not be used for *any* aspect of model development or calculation of forecasts. * Forecast accuracy is based only on the test set. --- class: middle ### Forecast errors Forecast "error": the difference between an observed value and its forecast. `$$e_{T+h} = y_{T+h} - \hat{y}_{T+h|T},$$` where the training data is given by `\(\{y_1,\dots,y_T\}\)` - Unlike residuals, forecast errors on the test set involve multi-step forecasts. - These are *true* forecast errors as the test data is not used in computing `\(\hat{y}_{T+h|T}\)`. --- class: middle ## Measures of forecast accuracy <img src="data:image/png;base64,#index_files/figure-html/returnsaccuracy-1.png" style="display: block; margin: auto;" /> --- class: middle ## Measures of forecast accuracy - `\(y_{T+h}\)` is `\((T+h)\)` th observation, `\(h=1,\dots,H\)` - `\(\hat{y}^{T+h}_{T}\)` is the forecast based on data up to time `\(T\)` - `\(e_{T+h} = y_{T+h} - {y}^{T+h}_{T}\)` `$$\text{MAE} = \text{mean}(|e_{T+h}|)$$` `$$\text{MSE} = \text{mean}(e_{T+h}^2) \qquad$$` `$$\text{RMSE} = \sqrt{\text{mean}(e_{T+h}^2)}$$` `$$\text{MAPE} = 100\text{mean}(|e_{T+h}|/ |y_{T+h}|)$$` * MAE, MSE, RMSE are all scale dependent. * MAPE is scale independent but is only sensible if `\(y_t\gg 0\)` for all `\(t\)`, and `\(y\)` has a natural zero. --- class: middle ## Measures of forecast accuracy .blockquote[ .large[Mean Absolute Scaled Error] $$ \text{MASE} = \text{mean}(|e_{T+h}|/Q) $$ where `\(Q\)` is a stable measure of the scale of the time series `\(\{y_t\}\)`. <br> .hand[Proposed by Hyndman and Koehler (IJF, 2006).] - For non-seasonal time series, `$$Q = (T-1)^{-1}\sum_{t=2}^T |y_t-y_{t-1}|$$` works well. Then MASE is equivalent to MAE relative to a naïve method. ] --- class: middle .panelset[ .panel[ .panel-name[Glencore example] <!-- --> ] .panel[ .panel-name[Statistical accuracy] <table class="table table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> RMSE </th> <th style="text-align:right;"> MAE </th> <th style="text-align:right;"> MAPE </th> <th style="text-align:right;"> MASE </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Mean method </td> <td style="text-align:right;"> 0.14 </td> <td style="text-align:right;"> 0.1 </td> <td style="text-align:right;"> 100.37 </td> <td style="text-align:right;"> 0.68 </td> </tr> <tr> <td style="text-align:left;"> Naïve method </td> <td style="text-align:right;"> 0.14 </td> <td style="text-align:right;"> 0.1 </td> <td style="text-align:right;"> 99.53 </td> <td style="text-align:right;"> 0.67 </td> </tr> <tr> <td style="text-align:left;"> Seasonal naïve method </td> <td style="text-align:right;"> 0.14 </td> <td style="text-align:right;"> 0.1 </td> <td style="text-align:right;"> 100.37 </td> <td style="text-align:right;"> 0.68 </td> </tr> </tbody> </table> ] ] --- class: middle ## External validity .hand-large[cross-validation] .panelset[ .panel[ .panel-name[traditional] <!-- --> ] .panel[ .panel-name[Time-series] <!-- --> ] ] --- class: middle .salt[ **Explanation** * Forecast accuracy averaged over test sets. * Also known as "evaluation on a rolling forecasting origin" ] --- class: middle ## tsCV function: * The following compares RMSE obtained via time series cross-validation with the residual RMSE ```r glen_m_ts1 <- window(glen_m_r, end=c(2017,12)) # one-step ahead forecast errors for drift method e <- tsCV(glen_m_r, rwf, drift=TRUE, h=1) # RMSE of forecast errors sqrt(mean(e^2, na.rm=TRUE)) ``` ``` ## [1] 0.1734816 ``` ```r # In-sample residuals of drift method sqrt(mean(residuals(rwf(glen_m_ts1,drift=TRUE))^2, na.rm=TRUE)) ``` ``` ## [1] 0.1697185 ``` --- class: middle .pull-left[ .large[`tsCV function` Inference] * As expect, the RMSE from the residuals is smaller, as the corresponding *retrodictions* are based on a model fitted to the entire data set. * **They are not true forecasts** * A good way to choose the best forecasting model is to find the model with the smallest RMSE computed using time series cross-validation. ] .pull-right[ * Using `ftse_m_ts` the code below evaluates the forecasting performance of 1 to 10 steps ahead naive forecasts wuth `tsCV`, using MSE as the forecast error measure. <img src="data:image/png;base64,#index_files/figure-html/stepsheaderror-1.png" width="50%" style="display: block; margin: auto;" /> * As you would expect, the forecast error increases as the forecast horizon increases. ] --- class: middle ## Prediction intervals * A forecast `\(\hat{y}_{T+h|T}\)` is (usually) the mean of the conditional distribution `\(y_{T+h} \mid y_1, \dots, y_{T}\)`. * A prediction interval gives a region within which we expect `\(y_{T+h}\)` to lie with a specified probability. * Assuming forecast errors are normally distributed, then a 95% PI is `$$\hat{y}_{T+h|T} \pm 1.96 \hat\sigma_h$$` - where `\(\hat\sigma_h\)` is the st dev of the `\(h\)`-step distribution. - When `\(h=1\)`, `\(\hat\sigma_h\)` can be estimated from the residuals. --- class: middle ## Prediction intervals **Naive forecast with prediction interval:** ```r glen_m_ts1 %>% rwf %>% residuals -> res res_sd <- sqrt(mean(res^2, na.rm=TRUE)) c(tail(glen_m_ts1,1)) + 1.96 * res_sd * c(-1,1) ``` ``` ## [1] -0.1942871 0.4710436 ``` ```r rwf(glen_m_ts1, level=95) ``` ``` ## Point Forecast Lo 95 Hi 95 ## Jan 2018 0.1383782 -0.1942810 0.4710375 ## Feb 2018 0.1383782 -0.3320730 0.6088295 ## Mar 2018 0.1383782 -0.4378045 0.7145609 ## Apr 2018 0.1383782 -0.5269403 0.8036967 ## May 2018 0.1383782 -0.6054705 0.8822269 ## Jun 2018 0.1383782 -0.6764672 0.9532236 ## Jul 2018 0.1383782 -0.7417554 1.0185119 ## Aug 2018 0.1383782 -0.8025242 1.0792807 ## Sep 2018 0.1383782 -0.8595995 1.1363560 ## Oct 2018 0.1383782 -0.9135827 1.1903391 ``` --- class: middle ## Prediction intervals * Point forecasts are often useless without prediction intervals. * Prediction intervals require a stochastic model (with random errors, etc). * Multi-step forecasts for time series require a more sophisticated approach (with PI getting wider as the forecast horizon increases). Assume residuals are normal, uncorrelated, sd = `\(\hat\sigma\)`: ||| |:--:|:--:| | Mean forecasts: | `\(\hat\sigma_h = \hat\sigma\sqrt{1 + 1/T}\)` | |Naïve forecasts: | `\(\hat\sigma_h = \hat\sigma\sqrt{h}\)`| |Seasonal naïve forecasts | `\(\hat\sigma_h = \hat\sigma\sqrt{k+1}\)`| | Drift forecasts: | `\(\hat\sigma_h = \hat\sigma\sqrt{h(1+h/T)}\)`| - where `\(k\)` is the integer part of `\((h-1)/m\)`. - Note that when `\(h=1\)` and `\(T\)` is large, these all give the same approximate value `\(\hat\sigma\)`. --- class: middle ## Prediction intervals * Computed automatically using: `naive()`, `snaive()`, `rwf()`, `meanf()`, etc. * Use `level` argument to control coverage. * Check residual assumptions before believing them. * Usually too narrow due to unaccounted uncertainty. --- class: middle ## Rethinking prediction uncertainty using Bayesian forecasting <!-- Taken from Tsuey (2010) Analysis of Financial Time Series 3rd Edition page 55 --> * In practice, estimated parameters are often used to compute point and interval forecasts. * This results in a **conditional forecast** because such a forecast does not take into consideration the uncertainty in the parameter estimates. * In theory, one can consider parameter uncertainty in forecasting, but it is much more involved. * A natural way to consider parameter and model uncertainty in forecasting is Bayesian forecasting with Markov chan Monte Carlo (MCMC) methods. --- class: middle .salt[ # Bayesian time series forecasting - `bayesforecast` fits Bayesian time series using [*Stan*](https://mc-stan.org). - Stan is a state-of-the-art platform for statistical modeling and high-performance statistial computation. ] --- class: middle .panelset[ .panel[ .panel-name[Daily FTSE] <img src="data:image/png;base64,#index_files/figure-html/unnamed-chunk-4-1.png" width="50%" /> ] .panel[ .panel-name[Bayesian naive forecast] ``` ## ## SAMPLING FOR MODEL 'Sarima' NOW (CHAIN 1). ## Chain 1: ## Chain 1: Gradient evaluation took 6.8e-05 seconds ## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.68 seconds. ## Chain 1: Adjust your expectations accordingly! ## Chain 1: ## Chain 1: ## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup) ## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup) ## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup) ## Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup) ## Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup) ## Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup) ## Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling) ## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling) ## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling) ## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling) ## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling) ## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling) ## Chain 1: ## Chain 1: Elapsed Time: 0.507382 seconds (Warm-up) ## Chain 1: 0.336278 seconds (Sampling) ## Chain 1: 0.84366 seconds (Total) ## Chain 1: ## ## SAMPLING FOR MODEL 'Sarima' NOW (CHAIN 2). ## Chain 2: ## Chain 2: Gradient evaluation took 4.6e-05 seconds ## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.46 seconds. ## Chain 2: Adjust your expectations accordingly! ## Chain 2: ## Chain 2: ## Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup) ## Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup) ## Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup) ## Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup) ## Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup) ## Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup) ## Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling) ## Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling) ## Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling) ## Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling) ## Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling) ## Chain 2: Iteration: 2000 / 2000 [100%] (Sampling) ## Chain 2: ## Chain 2: Elapsed Time: 0.482005 seconds (Warm-up) ## Chain 2: 0.314639 seconds (Sampling) ## Chain 2: 0.796644 seconds (Total) ## Chain 2: ## ## SAMPLING FOR MODEL 'Sarima' NOW (CHAIN 3). ## Chain 3: ## Chain 3: Gradient evaluation took 4.8e-05 seconds ## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.48 seconds. ## Chain 3: Adjust your expectations accordingly! ## Chain 3: ## Chain 3: ## Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup) ## Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup) ## Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup) ## Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup) ## Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup) ## Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup) ## Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling) ## Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling) ## Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling) ## Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling) ## Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling) ## Chain 3: Iteration: 2000 / 2000 [100%] (Sampling) ## Chain 3: ## Chain 3: Elapsed Time: 0.478392 seconds (Warm-up) ## Chain 3: 0.284917 seconds (Sampling) ## Chain 3: 0.763309 seconds (Total) ## Chain 3: ## ## SAMPLING FOR MODEL 'Sarima' NOW (CHAIN 4). ## Chain 4: ## Chain 4: Gradient evaluation took 4.5e-05 seconds ## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.45 seconds. ## Chain 4: Adjust your expectations accordingly! ## Chain 4: ## Chain 4: ## Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup) ## Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup) ## Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup) ## Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup) ## Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup) ## Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup) ## Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling) ## Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling) ## Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling) ## Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling) ## Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling) ## Chain 4: Iteration: 2000 / 2000 [100%] (Sampling) ## Chain 4: ## Chain 4: Elapsed Time: 0.470443 seconds (Warm-up) ## Chain 4: 0.309917 seconds (Sampling) ## Chain 4: 0.78036 seconds (Total) ## Chain 4: ``` ] .panel[ .panel-name[Check the model simulations] <!-- --> ] .panel[ .panel-name[Check residuals] <img src="data:image/png;base64,#index_files/figure-html/unnamed-chunk-7-1.png" width="50%" /> ] .panel[ .panel-name[Probabilistic forecasts] - Probabilistic forecasts for the next 100 days <img src="data:image/png;base64,#index_files/figure-html/unnamed-chunk-8-1.png" width="50%" /> ] ] --- class: middle ## Using the state-of-the-art `Prophet` algorithm - https://facebook.github.io/prophet/ .blockquote.large[ Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.] --- class: middle ## `Prophet` forecasts of the FTSE .panelset[ .panel[ .panel-name[Build Prophet Model] ``` ## ds ## 900 2024-02-17 ## 901 2024-02-18 ## 902 2024-02-19 ## 903 2024-02-20 ## 904 2024-02-21 ## 905 2024-02-22 ``` ] .panel[ .panel-name[Predict using model] ] .panel[ .panel-name[plot forecasts] <img src="data:image/png;base64,#index_files/figure-html/unnamed-chunk-11-1.png" width="60%" /> ] .panel[ .panel-name[Inference] - The default model in the `Prophet` algorithm is linear and additive - While this is a *state-of-the-art* automated probabilistic forecasting technique, it performs poorly given our domain knowledge of the ebbs and flows of financial markets - .fatinline[With great power comes great responsibility] - Fine tuning the algorithm is probably required if you are to use this in your project. - Compared to the naive model is it an improvement?? ] ] --- class: middle # Rethinking regression assumptions .saltinline[ |Assumption | Importance | |:---:|:---:| |Validity | `\(\downarrow\)`| |Linear and additive | Decreasing| |Independence of errors |Importance| |Equality of variance | ... | |Normality of errors | Least important| ] * Further assumptions are required if a regression coefficients are to be given a casual interpretation, in general it is importance to check there is no **endogeniety** present in the model --- class: middle ## Validity * The data you are analyzing should map to the research question you are trying to answer. * This is obvious but is frequently ignore due to inconvenience. * Optimally this means that the outcome measure should accurately reflect the phenomenon of interest. * Choosing predictors variables is generally the most challenging step. * Optimally all *relevant* predictors should be included, but it can be difficult to determine which are necessary and how to interpret coefficients with large standard errors. * Finally a representation sample that reflects the true distribution of the underlying populations is vital to make generalized inferences. --- class: middle ## Additivity and linearity * The most important mathematical assumption of a regression model is that its deterministic component is a linear function of the separate predictors: `$$y_t \sim N(\mu,\sigma^2)$$` where `$$\mu_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + \cdots + \beta_kx_{k,t}$$` * If additivity and linearity are violated it might make sense to transform the data. --- class: middle ## Additivity and linearity * Consider `\(y= x_1^{\beta_1} \times x_2^{\beta_2} \times x_3^{\beta_3}\)` where y is a multiplicative and non-linear function of the predictors. * By taking logs of both sides we induce linearity and additivity `\(ln(y)=B_1ln(x_1)+B_2ln(x_2)+B_3ln(x_3)\)` * Important to note that that now our betas are slightly different! * When two predictors are suspected of having a multiplicative influence an **interaction term** can be used --- class: middle ## Other assumptions * **Independence of errors**: The classical linear regression model assumes that model errors are independent. * **Equal variance of errors**: If the variance of the regression errors are unequal, estimation is more efficiently performed using weighted least squares, where each point is weighted inversely to its variance. * Unequal variance doesn’t affect the most important aspect of a regression model, which is the form of the predictor. * **Normality of errors**: This is generally the least important and for the purpose of estimating (training) the regression line (as compared to predicting individual data points) the assumption of normality is barely important at all. --- class: middle # Selecting predictors and forecast evaluation * When there are many predictors, how should we choose which ones to use? * We need a way of comparing two competing models (*or narrow down our choice*) **What not to do!** * Plot `\(y\)` against a particular predictor ($x_j$) and if it shows no noticeable relationship, drop it. * Do a multiple linear regression on all the predictors and disregard all variables whose `\(p\)` values are greater than 0.05. * Maximize `\(R^2\)` or minimize MSE --- class: middle # Rethinking: model comparison and selection ## Model checking * Every model is a merger of **sense** and **nonsense** * When we understand a model, we find its sense and control its nonsense. * Complex models should not be view with awe but with **informed** suspicion. * This intellectual discipline comes with breaking down the model into its components and checking its validity. --- class: middle ## Model comparison and selection * As modelers of financial time series phenomena we are confronted with a paradox * Finance is an empirical science based on empirical facts * Data are scarce, and many theories and models fit the same data * How do we therefore set up a null model and use data to falisfy it?? --- class: middle ## Model comparison and selection * As a result of the scarcity of financial data, many statistical models, even simple ones, can be compatible with the same data with roughly the same level of significance. * For example, the stock price process have been described by many competing statistical models, including the *random walk* we encountered earlier. * See Timmermann, A. (2008). Elusive return predictability. International Journal of Forecasting, 24, 1–18 in week 5 reading * In this paper they fit 11 possible models forecasting models with varying success. --- class: middle ## Model complexity and sample size <!-- --><!-- --> --- class: middle ## Model complexity and sample size * Machine learning (ML) in financial modeling has gained in popularity as a consequence of the diffusion of low-cost high-performance computing. * ML uses a family of highly flexible models that can approximate sample data with unlimited precision. * For example neural networks (*deep learning*), with an unrestricted number of layers and nodes, can approximate any function with arbitrary precision. * In mathematics, they are known as a *universal function approximator*. * Some *machine learning* appears in most financial econometric endeavours. --- class: middle ## Model complexity and sample size * In practice, representing sample data with high precision results in poor forecasting performance. * Financial data features have both a structural and noise component. * A high precision model will try to exactly fit the structural part of data (in-sample) but will also try to match the unpredictable noise. * Recall this phenomenon is called **overfitting**. --- class: middle ## Model complexity and sample size * Machine learning theory provides some criteria to constrain the complexity of the model so that it fits the data only partially but, as a trade-off, retains some forecasting power. * **Information criteria statistics** * The theory intuits that: * **The structure of the data and the sample size dictate the complexity of the laws that can be learned by computer algorithms**. * This is achieve using a penality function which is itself a function of sample size and complexity. --- class: middle ## Model complexity and sample size * This learning theory constrains model dimensionality to make them adapt to the sample size and structure. * The penalty term usually increases with the number of parameters but gets smaller with sample size. * The point is if we have only a small sample data set, we can only learning simple patterns, provided those patterns exist. >MODEL COMPLEXITY vs FORECASTING ABILITY --- class: middle ## Model complexity and sample size .pull-left[ * At the other end is the **theoretical approach** to model selection, typical in the physical sciences, which is based on human creativity. * Models are the result of new scientific insights that have been embodied in theories. * A well-known example in finance is CAPM. ] .pull-right[ * In modern computer-based financial econometrics a hybrid approach, mixing both theoretical and machine learning elements, is common. 1. The theoretical foundation identify a family of models 2. Learning approach chooses the *correct* model(s) within the family. * For example ARCH/GARCH family of models was suggested by theory but selected via machine learning techniques. ]